Programming for Data Analysis Project 2020

Table of contents

  1. Introduction

  2. Factors

    2.1 Income

    2.2 Cognitive Ability

    2.3 Big Five Personality Traits

    2.4 Gender

    2.5 Education

  3. Simulation

    3.1. Correlation matrix

    3.2. Covariance Matrix

    3.3. Data Generation

    3.4. Classification of Income

    3.5. Classification of Education Level

    3.6. Classification of Gender

    3.7. Details on dataset

    3.8. Visualisation

    3.9. Output dataframe to csv

  4. References

1 Introduction

The purpose of this project is to simulate a dataset which represents income in the United States and the various factors that are associated with same. In this instance, the factors are cognitive ability and personality (Big Five traits Openness, Conscientiousness, Extraversion, Agreeableness and Neuroticism). In addition, the relationship between gender, education level and income will be investigated.

The study of the distribution of Income and the factors that correlate with it are of interest because of questions concerning income equality and how much of success is due to so called "fair" factors such as personality and cognitive ability and "unfair" factors such as parental wealth, social status and gender privilege.

Each of the factors and their distribution are described as follows:

2 Factors

2.1 Income

The factor that is of primary concern in this simulation is total lifetime income in the United States. The aim of the project is to simulate a distribution of total lifetime income given the mean and standard deviation of total lifetime income given in the paper "Who Does Well in Life? Conscientious Adults Excel in Both Objective and Subjective Success" 1. Total lifetime income is defined as the cumulative income earned on average by an individual of age 68 years with range 30-91 years (so on average lifetime income is cumulative income earned over 50 years from 18-68 years of age).

Generally, income (including lifetime income) is best described using the lognormal distribution as the mode tends to be less than the median which is less than the mean (reflecting income equality). However, due to the earnings were capped at a taxable maximum in the paper, the data is not sufficiently skewed to justify the transformation and can be modelled as a normal distribution.

The mean and standard deviation of total lifetime income are 980,000 US dollars and 738,000 US dollars respectively.

NOTE: Negative income values are not omitted. This is because it is possible to have a net negative lifetime income if one is heavily in debt at the age of 68.

2.2 Cognitive ability

Cognitive ability as measured by the paper includes memory, vocabulary and numeracy. It correlates positively with income. These measures (in particular vocabulary) correlate very highly with IQ 2. IQ is measured initially on an ordinal scale with percentiles but is approximated on the interval scale as a normal distribution or bell curve across a whole population 3. Cognitive ability correlates positively with income.

In the paper, the mean and standard deviation are 0 and 1. It is normally distributed like IQ (although IQ usually has mean 100 and standard deviation 15).

2.3 Big Five Personality Traits

The Big Five Personaltiy traits are an attempt by psychologists to encapsulate and quantify several personality traits 4. They traits are similar to IQ in that they are measured on the ordinal scale by psychologists but approximated on the interval scale as a normal distribution across a whole population 5. The traits are measured on a scale where 4 means an individual scores extremely high in a trait and 1 means they score extremely low in a trait.

The Big Five traits are Openness, Conscientiousness, Extraversion, Agreeableness and Emotional Stability/Neuroticism (the facets and domains of the Big Five are not considered here). The mean and standard deviations are drawn from the paper.

2.3.1 Openness

Openness measures the level of interest in art, intellectual pursuits and creativity in an individual. It also measures how unconventional and fantasy-prone they are. It correlates positively with income. The mean is 1.95 and the standard deviation is 0.55.

2.3.2 Conscientiousness

Conscientiousness measures how industrious, organised and self-disciplined an individual is. It also measures how cautious and dutiful they are. It correlates positively with income. The mean is 2.56 and the standard deviation is 0.48.

2.3.3 Extraversion

Extraversion measures how talkative, assertive and sensation-seeking an individual is. It also measures their level of activity and positive emotions. It correlates negatively with income. The mean is 2.2 and 0.55.

2.3.4 Agreeableness

Extraversion measures how modest, altruistic and honest an individual is. It also measures how compassionate and trusting they are. It correlates negatively with income. The mean is 2.53 and the standard deviation is 0.47.

2.3.5 Emotional Stability/Neuroticism

Emotional stability is the reverse of Neuroticism. Neuroticism measures how anxious, fearful and depressed an individual is. It also measures how self-conscious and impulsive they are. It correlates negatively with Income, whereas its reverse Emotional Stability correlates positively with income. The mean (of Emotional Stability) is 2.71 and the standard deviation is 0.61.

2.4 Gender

Gender, as defined in this project, is whether the individual identified as male or female. Since this project's purpose is to generate a dataset which correlates with total lifetime income for people who are on average 68 years old, the expectation is that men's lifetime income will dwarf women's lifetime income because for much of the 20th century men were the sole breadwinners of households, in the United States and in other industrialised nations 6.

In this project, it will be represented as a categorical variable which can be either 'M' or 'F'. It will be generated by taking the income brackets, getting the probability of a given individual being in that bracket if they're male (and hence the probability that they are female) and generating a 1 or 0 from the binomial distribution which will then be set to 'M' or 'F' respectively and incorporated into the dataset as such.

2.5 Education

Education level is defined in this paper as the highest level of education attained by an individual in their lifetime. There are five levels of American education described here: less than high school (LTHS) high school graduate (HSG) some college (SC) bachelor’s degree only (BA) and graduate degree attainment (GRAD), taken from the paper "Education and Lifetime Earnings in the United States" 7. Education levels typically correlate positively with income 8.

In the paper, bar charts of education levels and gross lifetime earnings are presented for men and women. The values for earnings and education level are drawn from these barcharts. Education level is a categorical variable in the paper but for the sake of simplicity they will be changed to a numerical variable where 1 is the lowest level (LTHS) and 5 is the highest level (GRAD). The correlation between gross lifetime earnings and education is then calculated and the dataset representing the level of education is generated using that correlation.

Afterwards, when the values for Education level are appended to the dataset, the data is transformed back into a categorical variable.

3 Simulation

3.1 Correlation Matrix

To generate the correlation between Income and Education, data from the second paper referenced is used in conjunction with the module scipy.stats.stats to generate the correlation coefficient between lifetime earnings (Income) and highest level of education achieved (Education).

For correlations between educational level and the Big Five personality traits, the correlations between GPA (Grade Point Average) and the Big Five taken from "Personality Predictors of Academic Outcomes: Big Five Correlates of GPA and SAT Scores" 9 are used (this is a simplification as GPA or grade point average correlates highly with educational level but is not identical to educational level).

In addition, the correlation between GPA and IQ (and hence cognitive ability) is taken from "Personality and Intelligence Interact in the Prediction of Academic Achievement" 10.

Since all the factors correlate with income (and most the factors also correlate with each other to some degree), the multivariate normal distribution is used to generate the dataset 11. The correlation matrix used in the multivariate normal distribution is entered using the correlations from the first paper as well as the correlation calculated in the previous cell. The rest are taken from the second paerp(the pretty_print_matrix function is taken from Stack Overflow 12).

The correlation matrix should be symmetric. A quick way to test this is by checking if it is equal to its transpose. This is accomplished using the numpy package and the T function:

The transpose of the matrix is equal to the matrix, therefore they are identical.

3.2 Covariance matrix

The correlation matrix is used to generate a covariance matrix 13.

The covariance matrix is a matrix of numbers which indicate how the variables in the dataset relate to each other. It is calculated using the correlation matrix and an array of standard deviations.

The code to do this is as follows:

3.3 Data Generation

The dataframe is generated by taking the covariance matrix and list of means and converting them into numpy arrays. They are used in the np.random.multivariate_normal function to generate the dataset.

The dataset is then used as data for a pandas dataframe called df.

3.4 Classification of Income

To classify each income range, a very simple classification scheme is used whereby incomes in 98th percentile are categorised as "H" for High, incomes above the 84th percentile and below and equalling the 98th percentile are categorised as "HM" for High Middle, incomes above the 50th percentile and below and equalling the 84th percentile are categoried as "LM", incomes above the 16th percentile and below and equalling the 50th percentile are categorised as "HL" for High Low, incomes between the 2nd percentile and below and equalling the 16th percentile are categorised as "LM" for Low Middle and incomes at or below the 2th percentile are categorised as "L" for low.

This classification scheme is done as follows:

3.5 Classification of Education

The education level was converted to a numerical variable in Section 3.1. In this section, it is converted back into the categorical variable as follows:

The "Education" column can now be dropped:

These income brackets can be used to estimate the probability of a particular row being male or female. This is done in the following section.

3.6 Classification of Gender

The first paper referenced previously is controlled for gender so gender cannot be classified using data from that paper. However, the mean and standard deviation of lifetime incomes (defined here as cumulative income after 50 years) for men and women are present in the second paper referenced "Education and Lifetime Earnings in the United States". The probabilities that you're a man given you're in a certain income bracket can be calculated using these values.

This is done by taking the lifetime income values for men and women, getting the means and standard deviations for men, women and the total (men and women together), generating normal distributions using these values and getting the proportion of men who are in each bracket (each bracket being delinated by what percentile range they are in the distribution).

This is divided by the number of men and women in that bracket to get the probability of being a man, given that you're in that bracket.

This is done using the following code:

These values are used to generate randomly generated variables using the binomial distrubtion with 1 trial for each income bracket (this is technically the Bernoulli distribution 14). These variables are appended to a list of genders. They are either 1 (Male) or 0 (Female).

The function to generate this list is as follows:

The proportion of men (i.e. the number of 1's) generated by this method is 0.518 or 51.8%, which is approximately what one would expect in the general population (50% men, 50% women).

This function is assigned to a column called 'Gender' in the dataframe df:

The dataframe column Gender is made into a categorical variable with values either being 'M' or 'F' based on male being 1 and female being 0, as follows:

3.7 Details on dataset

The means of the columns of df and standard deviations of the columns of df are given as follows:

The first 5 elements are given as follows:

The correlation matrix for this dataset is:

3.8 Visualisation

A pairplot of the dataset is given as follows (marked by Income Class)

A plot divided using Gender is given as follows:

A plot divided using Education Level is as follows:

3.9 Output dataframe to csv file

The dataframe can be output to a csv file using the to_csv function from the pandas package:

4 References

[1] Duckworth, A., Weir, D., Tsukayama, E. and Kwok, D., 2012. Who Does Well in Life? Conscientious Adults Excel in Both Objective and Subjective Success. Frontiers in Psychology, 3.
[2] Doi.apa.org. 2020. APA Psycnet. [online] Available at: https://doi.apa.org/doiLanding?doi=10.1037%2F0003-066X.51.2.77 [Accessed 26 November 2020].
[3] Psychology.emory.edu. 2020. Interval. [online] Available at: http://www.psychology.emory.edu/clinical/bliwise/Tutorials/SOM/smmod/scalemea/print2.htm [Accessed 26 November 2020].
[4] Psychology Today. 2020. Big 5 Personality Traits. [online] Available at: https://www.psychologytoday.com/ie/basics/big-5-personality-traits [Accessed 26 November 2020].
[5] Reflectd. 2020. A Look Into Personality And The Big Five Personality Traits. [online] Available at: https://reflectd.co/2013/03/22/what-is-personality-does-it-change/ [Accessed 26 November 2020].
[6] Ortiz-Ospina, E., Tzvetkova, S. and Roser, M., 2020. Women’S Employment. [online] Our World in Data. Available at: https://ourworldindata.org/female-labor-supply [Accessed 11 December 2020].
[7] Tamborini, C., Kim, C. and Sakamoto, A., 2015. Education and Lifetime Earnings in the United States. Demography, 52(4), pp.1383-1407.
[8] Research.stlouisfed.org. 2020. Education Income And Wealth | St. Louis Fed. [online] Available at: https://research.stlouisfed.org/publications/page1-econ/2017/01/03/education-income-and-wealth/ [Accessed 18 December 2020].
[9] Noftle, E. and Robins, R., 2007. Personality predictors of academic outcomes: Big five correlates of GPA and SAT scores. Journal of Personality and Social Psychology, 93(1), pp.116-130.
[10] Bergold, S. and Steinmayr, R., 2018. Personality and Intelligence Interact in the Prediction of Academic Achievement. Journal of Intelligence, 6(2), p.27.
[11] Numpy.org. 2020. Numpy.Random.Multivariate_Normal — Numpy V1.19 Manual. [online] Available at: https://numpy.org/doc/stable/reference/random/generated/numpy.random.multivariate_normal.html [Accessed 26 November 2020].
[12] list, P., Nanda, S. and López, R., 2020. Pretty Print 2D Python List. [online] Stack Overflow. Available at: https://stackoverflow.com/questions/13214809/pretty-print-2d-python-list [Accessed 26 November 2020].
[13] Medium, 2020, Let Us Understand The Correlation Matrix And Covariance Matrix. [online] Available at: https://towardsdatascience.com/let-us-understand-the-correlation-matrix-and-covariance-matrix-d42e6b643c22. [Accessed 16 December 2020]
[14] Unf.edu. 2020. [online] Available at: https://www.unf.edu/~cwinton/html/cop4300/s09/class.notes/DiscreteDist.pdf [Accessed 11 December 2020].